There is a dataset Stack Overflow available with the following data:
CreationDate
: the timestamp of the creation date of a Stack Overflow post (= question)TagName
: the tag name for a technology (in our case for only 4 VCSes: "cvs", "svn", "git" and "mercurial")ViewCount
: the numbers of views of a postThese are the first 10 entries of this dataset:
CreationDate,TagName,ViewCount
2008-08-01 13:56:33,svn,10880
2008-08-01 14:41:24,svn,55075
2008-08-01 15:22:29,svn,15144
2008-08-01 18:00:13,svn,8010
2008-08-01 18:33:08,svn,92006
2008-08-01 23:29:32,svn,2444
2008-08-03 22:38:29,svn,871830
2008-08-03 22:38:29,git,871830
2008-08-04 11:37:24,svn,17969
In [1]:
import pandas as pd
vcs_data = pd.read_csv('../dataset/stackoverflow_vcs_data_subset.gz')
vcs_data.head()
Out[1]:
In [2]:
vcs_data['CreationDate'] = pd.to_datetime(vcs_data['CreationDate'])
vcs_data.head()
Out[2]:
In [3]:
number_of_views = vcs_data.groupby(['CreationDate', 'TagName']).sum()
number_of_views.head()
Out[3]:
In [4]:
views_per_vcs = number_of_views.unstack()['ViewCount']
views_per_vcs.head()
Out[4]:
In [5]:
monythly_views = views_per_vcs.resample("1M").sum().cumsum()
monythly_views.head()
Out[5]:
In [6]:
%matplotlib inline
monythly_views.plot(title="monthly stackoverflow post views");